Citizen science projects like iNaturalist are becoming more credible as sources of research-grade data. While evasive species can create biases in absolute counts of observations, studies concerning within-species relative abundance may sidestep these issues by assuming that such reporting biases are uniform in time. Here we consider the relative abundance of 47 species of birds in Northern California over the 2019 calendar year. The number of observations for a given species per week is obtained from iNaturalist, accessed from their API using the R package spocc. Seasonality curves are then constructed from iNaturalist observations using techniques from kernel density estimation. The resulting seasonal trends may then be viewed as a sample of random curves (more specifically, probability densities) and can be studied under the framework of Functional Data Analysis. We will use functional principal component analysis to model the relative abundance of these species and identify a low-dimensional latent geometry which can be used to group similar bird species by their patterns of seasonality.
For the \(i^{th}\) species during the \(j^{th}\) week, we observe the number of observations reported on iNaturalist, \(x_{ij},~i=1,...,47,~j=1,...,52\). Using a Gaussian kernel \(K(u) = \dfrac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}u^2}\) we obtain estimates for the seasonality curves given by:
\[f_i(t) = \dfrac{1}{52h}\sum_{j=1}^{52}K(\dfrac{t-x_{ij}}{h}),\]
where the bandwidth \(h\) is a tuning parameter which modulates smoothness of curves and is chosen here to be \(6\) weeks. The estimated densities for the 47 birds are shown below.
The bold black curve denotes the pointwise mean seasonality, which is roughly constant with a slight skew toward the later weeks of the year. This suggests the sample contains birds with a variety of seasonalities, though perhaps fall and winter birds are slightly over-represented. Note the 5 “outlier” curves which stand out above the crowd. These curves represent species which demonstrate extreme seasonality for each season, with winter being split into “early winter” and “late winter”. This is the first sign of an interesting geometric phenomenon present in this data: we expect one representative peak per season, but because the calendar year artificially splits the winter months, the corresponding winter peak is divided into two. The Ferruginous hawk’s seasonality exemplifies this: it has a global maximum during the early weeks, but experiences another local maximum toward the end of the year. If we could recognize the underlying geometry on which these curves lie, a kernel density estimate would join these two peaks into one! We will investigate this geometry further using functional principal component analysis, which will also perform dimension reduction on the curves and facilitate comparisons in seasonality patterns across species.
While a curve-based representation of seasonality is intuitive and easy to read for an individual species, understanding and comparing the trends of several species at a time is more difficult, as evidenced by the crowded figure displayed above. This difficulty can be attributed to the high-dimensional nature of curves: an infinite number of individual data points make up a single curve observation. Variation in each of these dimensions can come together to form major trends of variation which have meaninful interpretations, or can be minor noise more analogous to random noise. To understand the difference and identify the main patterns of variation in the curves, functional principal component analysis (FPCA) can afford us a low dimensional representation of the curves based on the Karhunen-Lo`eve decomposition for stochastic processes. We describe the technical machinery in the appendix and display the results of FPCA here.
The first step in FPCA is to estimate the covariance surface for the sample of curves. The undulating shape of the surface indicates that prevalence is normally negatively correlated between its peak time and other seasons. The only exception is near the ends of the time domain, caused by the break in the winter season. Here we see the covariance rises back up to positive values, which suggests that January counts are associated with December counts, though this association is not as strong as that seen within immediately neighboring time points. An intuitive way of visualizing this observation is to consider tesselating the surface:the edges align from across copies of the surface! Using the functional analogue of spectral decomposition we can obtain the eigenfunctions of this surface and their corresponding scores for each individual species. The results are displayed below, using a 2-dimensional truncation which captures 90% of the original variation:
The first eigenfunction represents a contrast between late spring and early winter: species with high FPC1 scores will be more common in early winter and less prevalent in spring. The second eigenfunction follows a similar shape but is shifted in time: species with high FPC2 score will be more prevalent during summer and less common during late winter. In FPC space, we can group species together much more clearly than we could using only their curve representations. For example, the Oregon and Dark-Eyed Juncos exhibit very similar patterns of seasonality: with high FPC2 scores we can understand that these birds exhibit summer seasonality.
We recognize the species on the edge of the point clouds as the species which had the outlying curves mentioned before. They represent archetypal trends for each season. Summer is in the upper left corner, while winter is in the lower left. Here we really see the underlying geometry in action: the flow of time corresponds to a clockwise rotation through FPC space. Starting at the bottom with the Ferruginous hawk and sweeping a radius clockwisely, we pass through all of the seasons before returning back to winter. Another striking observation is that the point cloud forms a ring, mimicking the circular interpretation of seasonality. This suggests that the birds in our sample are highly seasonal with only a few showing constant levels of prevalence throughout the year. We can also consider representing the original set of seasonality curves on the surface of a cylinder, instead of a flat plane, in order to mirror this latent geometry and avoid the artificial split created by the calendar year.
We consider a generic seasonality curve \(f(t), ~t\in\mathcal{T}=[1,52]\) with mean curve \(\mu(t) = E(f(t))\) and covariance surface \(G(s,t)=\text{Cov}(f(s),f(t))\), which has eigenfunctions \(\varphi_1(t),\varphi_2(t),\dots\). The Karhunen–Lo`eve representation theorem states that \[\begin{equation} f(t) = \mu(t)+\sum_{k=1}^\infty \xi_k \varphi_k(t), \label{eq:KL} \end{equation}\] where the scores \(\xi_k = \int_\mathcal{T} (C(t) - \mu(t)) \varphi_k(t) dt\) satisfy \(E(\xi_k) = 0\), \(\text{Var}( \xi_k) = \lambda_k\) and \(E(\xi_k \xi_l) = 0\) for \(k \neq l\). Here \(\xi_k\) is the functional principal component score (FPC) of \(X(\cdot)\) associated with the \(k^{\text{th}}\) eigenfunction \(\varphi_k\). Thus, the FPC scores are projections of the centered stochastic process onto the directions given by the eigenfunctions and summarize how a function changes from the mean curve along the principal modes of variations. Moreover, the centered process is equivalent to the infinite-dimensional vector \(\xi_1, \xi_2, \dots\).
By truncating the vector representation to a finite number of \(K\) components one achieves dimension reduction and can approximate the original stochastic process through its most important modes of variations. That is, FPCA provides a fit of the original curve \[\begin{equation} \hat{f}(t) = \hat{\mu}(t)+\sum_{k=1}^K \hat{\xi}_k \hat{\varphi}_k(t) \end{equation}\]